A probability box (or p-box) is a characterization of an uncertain number consisting of both aleatoric and epistemic uncertainties that is often used in risk analysis or quantitative uncertainty modeling where numerical calculations must be performed. Probability bounds analysis is used to make arithmetic and logical calculations with p-boxes.
An example p-box is shown in the figure at right for an uncertain number x consisting of a left (upper) bound and a right (lower) bound on the probability distribution for x. The bounds are coincident for values of x below 0 and above 24. The bounds may have almost any shapes, including step functions, so long as they are monotonically increasing and do not cross each other. A p-box is used to express simultaneously incertitude (epistemic uncertainty), which is represented by the breadth between the left and right edges of the p-box, and variability (aleatory uncertainty), which is represented by the overall slant of the p-box.
Contents |
There are dual interpretations of a p-box. It can be understood as bounds on the cumulative probability associated with any x-value. For instance, in the p-box depicted at right, the probability that the value will be 2.5 or less is between 4% and 36%. A p-box can also be understood as bounds on the x-value at any particular probability level. In the example, the 95th percentile is sure to be between 9 and 16.
If the left and right bounds of a p-box are sure to enclose the unknown distribution, the bounds are said to be rigorous, or absolute. The bounds may also be the tightest possible such bounds on the distribution function given the available information about it, in which case the bounds are therefore said to be best-possible. It may commonly be the case, however, that not every distribution that lies within these bounds is a possible distribution for the uncertain number, even when the bounds are rigorous and best-possible.
P-boxes are specified by left and right bounds on the cumulative probability distribution function (or, equivalently, the survival function) of a quantity and, optionally, additional information about the quantity’s mean, variance and distributional shape (family, unimodality, symmetry, etc.). A p-box represents a class of probability distributions consistent with these constraints.
Let 𝔻 denote the space of distribution functions on the real numbers ℝ, i.e., 𝔻 = {D | D : ℝ → [0,1], D(x) ≤ D(y) whenever x < y, for all x, y ∈ ℝ}, and let 𝕀 denote the set of real intervals, i.e., 𝕀 = {i | i = [i1, i2], i1 ≤ i2, i1, i2 ∈ ℝ}. Then a p-box is a quintuple {F, F, m, v, F}, where F, F ∈ 𝔻, while m, v ∈ 𝕀, and F ⊆ 𝔻. This quintuple denotes the set of distribution functions F ∈ 𝔻 matching the following constraints:
Thus, the constraints are that the distribution function F falls within prescribed bounds, the mean of the distribution (given by the Riemann-Stieltjes integral) is in the interval m, the variance of the distribution is in the interval v, and the distribution is within some admissible class of distributions F. The Riemann-Stieltjes integrals do not depend on the differentiability of F.
P-boxes serve the same role for random variables that upper and lower probabilities serve for events. In robust Bayes analysis[1] a p-box is also known as a distribution band.[2][3] A p-box can be constructed as a closed neighborhood of a distribution F ∈ 𝔻 under the Kolmogorov, Lévy or Wasserstein metric. A p-box is a crude but computationally convenient kind of credal set. Whereas a credal set is defined solely in terms of the constraint F as a convex set of distributions (which automatically determine F, F, m, and v, but are often very difficult to compute with), a p-box usually has a loosely constraining specification of F, or even no constraint so that F = 𝔻. Calculations with p-boxes, unlike credal sets, are often quite efficient, and algorithms for all standard mathematical functions are known.
A p-box is minimally specified by its left and right bounds, in which case the other constraints are understood to be vacuous as {F, F, [– , ], [0, ], 𝔻}. Even when these ancillary constraints are vacuous, there may still be nontrivial bounds on the mean and variance that can be inferred from the left and right edges of the p-box.
P-boxes may arise from a variety of kinds of incomplete information about a quantity, and there are several ways to obtain p-boxes from data and analytical judgment.
When a probability distribution is known to have a particular shape (e.g., normal, uniform, beta, Weibull, etc.) but its parameters can only be specified imprecisely as intervals, the result is called a distributional p-box, or sometimes a parametric p-box. Such a p-box is usually easy to obtain by enveloping extreme distributions given the possible parameters. For instance, if a quantity is known to be normal with mean somewhere in the interval [7,8] and standard deviation within the interval [1,2], the left and right edges of the p-box can be found by enveloping the distribution functions of four probability distributions, namely, normal(7,1), normal(8,1), normal(7,2), and normal(8,2), where normal(μ,σ) represents a normal distribution with mean μ and standard deviation σ. All probability distributions that are normal and have means and standard deviations inside these respective intervals will have distribution functions that fall entirely within this p-box. The left and right bounds enclose many non-normal distributions, but these would be excluded from the p-box by specifying normality as the distribution family.
Even if the parameters such as mean and variance of a distribution are known precisely, the distribution cannot be specified precisely if the distribution family is unknown. In such situations, envelopes of all distributions matching given moments can be constructed from inequalities such as those due to Markov, Chebyshev, Cantelli, or Rowe[4][5] that enclose all distribution functions having specified parameters. These define distribution-free p-boxes because they make no assumption whatever about the family or shape of the uncertain distribution. When qualitative information is available, such as that the distribution is unimodal, the p-boxes can often be tightened substantially.[6]
When all members of a population can be measured, or when random sample data are abundant, analysts often use an empirical distribution to summarize the values. When those data have non-negligible measurement uncertainty represented by interval ranges about each sample value, an empirical distribution may be generalized to a p-box.[7] Such a p-box can be specified by cumulating the lower endpoints of all the interval measurements into a cumulative distribution forming the left edge of the p-box, and cumulating the upper endpoints to form the right edge. The broader the measurement uncertainty, the wider the resulting p-box.
Interval measurements can also be used to generalize distributional estimates based on the method of matching moments or maximum likelihood, that make shape assumptions such as normality or lognormality, etc.[7][8] Although the measurement uncertainty can be treated rigorously, the resulting distributional p-box generally will not be rigorous when it is a sample estimate based on only a subsample of the possible values. But, because these calculations take account of the dependence between the parameters of the distribution, they will often yield tighter p-boxes than could be obtained by treating the interval estimates of the parameters as unrelated as is done for distributional p-boxes.
There may be uncertainty about the shape of a probability distribution because the sample size of the empirical data characterizing it is small. Several methods in traditional statistics have been proposed to account for this sampling uncertainty about the distribution shape, including Kolmogorov-Smirnov[9] and similar[10] confidence bands, which are distribution-free in the sense that they make no assumption about the shape of the underlying distribution. There are related confidence-band methods that do make assumptions about the shape or family of the underlying distribution, which can often result in tighter confidence bands.[11][12][13] Constructing confidence bands requires one to select the probability defining the confidence level, which usually must be less than 100% for the result to be non-vacuous. Confidence bands at the (1−α)% confidence level are defined such that, (1−α)% of the time they are constructed, they will completely enclose the distribution from which the data were randomly sampled. A confidence band about a distribution function is sometimes used as a p-box even though it represents statistical rather than rigorous or sure bounds. This use implicitly assumes that the true distribution, whatever it is, is inside the p-box.
An analogous Bayesian structure is called a Bayesian p-box,[14] which encloses all distributions having parameters within a subset of parameter space corresponding to some specified probability level from a Bayesian analysis of the data. This subset is the credible region for the parameters given the data, which could be defined as the highest posterior probability density region, or the lowest posterior loss region, or in some other suitable way. To construct a Bayesian p-box one must select a prior distribution, in addition to specifying the credibility level (analogous to a confidence level).
P-boxes can arise from computations involving probability distributions, or involving both a probability distribution and an interval, or involving other p-boxes. For example, the sum of a quantity represented by a probability distribution and a quantity represented by an interval will generally be characterized by a p-box.[15] The sum of two random variables characterized by well-specified probability distributions is another precise probability distribution typically only when the copula (dependence function) between the two summands is completely specified. When their dependence is unknown or only partially specified, the sum will be more appropriately represented by a p-box because different dependence relations lead to many different distributions for the sum. Kolmogorov originally asked what bounds could be placed about the distribution of a sum when nothing is known about the dependence between the distributions of the addends.[16] The question was only answered in the early 1980s. Since that time, formulas and algorithms for sums have been generalized and extended to differences, products, quotients and other binary and unary functions under various dependence assumptions.[16][17][18][19][20][21][22]
These methods, collectively called probability bounds analysis, provide algorithms to evaluate mathematical expressions when there is uncertainty about the input values, their dependencies, or even the form of mathematical expression itself. The calculations yield results that are guaranteed to enclose all possible distributions of the output variable if the input p-boxes were also sure to enclose their respective distributions. In some cases, a calculated p-box will also be best-possible in the sense that only possible distributions are within the p-box, but this is not always guaranteed. For instance, the set of probability distributions that could result from adding random values without the independence assumption from two (precise) distributions is generally a proper subset of all the distributions admitted by the computed p-box. That is, there are distributions within the output p-box that could not arise under any dependence between the two input distributions. The output p-box will, however, always contain all distributions that are possible, so long as the input p-boxes were sure to enclose their respective underlying distributions. This property often suffices for use in risk analysis.
Precise probability distributions and intervals are special cases of p-boxes, as are real values and integers. Because a probability distribution expresses variability and lacks incertitude, the left and right bounds of its p-box are coincident for all x-values at the value of the cumulative distribution function (which is a non-decreasing function from zero to one). Mathematically, a probability distribution F is the degenerate p-box {F, F, E(F), V(F), F}, where E and V denote the expectation and variance operators. An interval expresses only incertitude. Its p-box looks like a rectangular box whose upper and lower bounds jump from zero to one at the endpoints of the interval. Mathematically, an interval [a, b] corresponds to the degenerate p-box {H(a), H(b), [a, b], [0, (b–a)2/4], 𝔻}, where H denotes the Heaviside step function. A precise scalar number c lacks both kinds of uncertainty. Its p-box is just a step function from 0 to 1 at the value c; mathematically this is {H(c), H(c), c, 0, H(c)}.
P-boxes and probability bounds analysis have been used in many applications spanning many disciplines in engineering and environmental science, including
No internal structure. Because a p-box retains little information about any internal structure within the bounds, it does not elucidate which distributions within the p-box are most likely, nor whether the edges represent very unlikely or distinctly likely scenarios. This could complicate decisions in some cases if an edge of a p-box encloses a decision threshold.
Loses information. To achieve computational efficiency, p-boxes lose information compared to more complex Dempster-Shafer structures or credal sets.[50] In particular, p-boxes lose information about the mode (most probable value) of a quantity. This information could be useful to keep, especially in situations where the quantity is an unknown but fixed value.
Traditional probability sufficient. Some critics of p-boxes argue that precisely specified probability distributions are sufficient to characterize uncertainty of all kinds. For instance, Lindley has asserted, "Whatever way uncertainty is approached, probability is the only sound way to think about it."[51][52] These critics argue that it is meaningless to talk about ‘uncertainty about probability’ and that traditional probability is a complete theory that is sufficient to characterize all forms of uncertainty. Under this criticism, users of p-boxes have simply not made the requisite effort to identify the appropriate precisely specified distribution functions.
Possibility theory can do better. Some critics contend that it makes sense in some cases to work with a possibility distribution rather than working separately with the left and right edges of p-boxes. They argue that the set of probability distributions induced by a possibility distribution is a subset of those enclosed by an analogous p-box's edges.[53][54] Others make a counterargument that one cannot do better with a possibility distribution than with a p-box.[55]
Baudrit, C., and D. Dubois (2006). Practical representations of incomplete probabilistic knowledge. Computational Statistics & Data Analysis 51: 86-108.
Baudrit, C., D. Dubois, D. Guyonnet (2006). Joint propagation and exploitation of probabilistic and possibilistic information in risk assessment. IEEE Transactions on Fuzzy Systems 14: 593-608.
Bernardini, A., and F. Tonon (2009). Extreme probability distributions of random/fuzzy sets and p-boxes. International Journal of Reliability and Safety 3: 57-78. (alternative link)
Destercke, S., D. Dubois and E. Chojnacki (2008). Unifying practical uncertainty representations – I: Generalized p-boxes. International Journal of Approximate Reasoning 49: 649-663 .
Dubois, D. (2010). (Commentary) Representation, propagation, and decision issues in risk analysis under incomplete probabilistic information. Risk Analysis 30: 361-368. DOI: 10.1111/j.1539-6924.2010.01359.x.
Dubois, D., and D. Guyonnet (2011). Risk-informed decision-making in the presence of epistemic uncertainty. International Journal of General Systems 40: 145-167.
Guyonnet, D., F. Blanchard, C. Harpet, Y. Ménard, B. Côme and C. Baudrit (2005). Projet IREA—Traitement des incertitudes en évaluation des risques d'exposition. Rapport BRGM/RP-54099-FR, Bureau de Recherches Géologiques et Minières, France.